Project: Autshumato III

Type: Monolingual corpus
Language: Setswana (tn_ZA, tsn_ZA)
Date: 2016-05-04
Version: 1.0.0 (Final)

Description: 
Setswana monolingual corpus as deliverable of the Autshumato project.
The data is provided as a UTF-8 text file; with each sentence on a newline.
The data is tokenised (inserted spaces between punctuation and words).

Monolingual Data Counts
Lines: 38 205
Setswana words: 879 248


Source(s):
Various sources, predominantly government domain.

Project website: http://autshumato.sourceforge.net/
_________________________________________________________________________________
Licence: Creative Commons Attribution 2.5 South Africa
 
URL: http://creativecommons.org/licenses/by/2.5/za/
 
Attribute work to: 
	CTexT (Centre for Text Technology, North-West University), South Africa; 
	Department of Arts and Culture, South Africa.
Attribute work to URL:	
	http://www.nwu.ac.za/ctext and 
	http://www.dac.gov.za/